Dictionaries & Language

home *** CD-ROM | disk | FTP | other *** search

/ Dictionaries & Language / Dictionaries and Language (Chestnut CD-ROM) (1993).iso / japanese / kdp32_1 / kanji.doc < prev next >

Wrap

Text File | 1992-05-23 | 15.6 KB | 314 lines

CHARACTER CODING OF JAPANESE ****************************************************************************** * This archive contains the kanji font file KDP16SJ.FNT, which is needed * * by the KDPLUS kanji preprocessor system. For those who would like to * * know how the font file is organized, the following notes have been * * provided which explain Japanese character coding. * ****************************************************************************** 1) Starting point: the ku-ten table All characters used in Japanese writing can be arranged in a table which is called the "ku-ten" table. The table, which is universally used, is 94 columns wide and 94 rows high, but rows 85 and up are empty (not used) at present. Numbering of rows and columns starts at 1 (not zero). Any character can be identified by specifying its row number (called its "ku" value) and its column number (called its "ten" value). The symbols in rows 1-47 are called "level 1 JIS (Japan Industrial Standard) characters"; they are the most commonly used characters. Rows 48 and up are called "level 2 JIS". The level 1 kanji (from row 16) are arranged according to pronunciation (on-yomi normally) and stroke count. A print-out of the "ku-ten" table can be found in the instruction manual of every Japanese "wapro" (word-processor) and every Japanese printer. In many "wapros" the ku-ten values of characters may be entered by hand. "Office Automation Dictionaries", available in Japan, enable you to look up the "ku- ten value" of any character. The "ku-ten" table is not completely standardized in Japan. The standardiza- tion applies only to rows 1-8 (kana, alphanumerics) and rows 16 and up (kanji); they are defined in JIS standard X-0208. Rows 9-15 are left blank in the standard and can, apparently, be filled in by manufacturers according to their own ideas. The blank areas in rows 1-8 are considered "reserved". The complete ku-ten table is contained in six files which go with this archive (see section 5). 2) Kanji fonts A kanji font is a set of binary data (a ROM chip, or a disk file) describing the actual appearance of the symbols. The file KDP16SJ.FNT is an "almost" standard 16 x 16 pixel kanji font (see section 7 for a summary of the changes which were made). It contains bitmap images of characters, each bitmap 16 pixels wide and 16 pixels high; each bitmap therefore occupies 32 bytes. The character bitmaps are arranged sequentially in the font file according to the character's position in the ku-ten table. The offset (in bytes) of the bitmap corresponding to character [ku,ten] is 32*((ku-1)*94+ten-1). The font file contains bit-maps for the first 83 rows of the ku-ten table (row 85 and up are empty anyhow, and row 84 contains only 5 rarely-used characters, so this is no great loss). The total number of character images in the font is thus 94*83=7802. The ku-ten table contains many gaps (incompletely filled rows). For instance, in row 8 only the first 32 places are filled (with line draw symbols), the rest is blank. Row 14 originally contained only 3 symbols (but now we have added some IBM control characters to that row). The blank areas are left blank in the font file; in other words, they are not skipped, but are represented by bit-map tables which consist of zeroes. This is, of course, a waste of space, but it makes for flexibility (you can put your own symbols there if you wish) and easy decoding. In the file KDP16SJ.FNT, the bitmap images in rows 9, 10, and 11 use only the left-hand half of the 16 x 16 pixel box. They can be displayed with a horizontal spacing of 8 pixels. 8-pixel, or half-character, symbols are called hankaku; characters which use the full 16 x 16 box are called zenkaku. In a 24 x 24 pixel font, zenkaku characters are be 24 pixels wide, hankaku characters are 12 pixels wide (in the font KDP24SJ.FNT, used by KPLJ24, the hankaku characters are in fact 13 pixels wide; KPLJ24 inserts 2 empty pixels between zenkaku characters to keep the zenkaku spacing twice the size of the hankaku spacing). 3) JIS coding The number of columns in the ku-ten table, 94, is not arbitrary; it is derived from the number of 7-bit ASCII characters. With 7 bits, 128 different characters can be represented; leaving out the characters 0 and 127, and also the characters 1-32 (control characters and space), we are left with 94 printable characters, having the numerical values 33-126. Any character in the ku-ten table can now be represented by 2 bytes: first byte : "ku" value + 32 second byte: "ten" value + 32 The first character in the ku-ten table, [ku=1, ten=1] is thus represented by the two bytes [33,33]. The first kanji character in the table (the character with pronunciation "A", meaning Asia), with ku=16 and ten=1, would be represented by the bytes [48,33], or, in ASCII, "0!". Thus we have a system of transmitting Japanese characters on channels which use 7-bit characters (especially mainframe systems). This is called the JIS code. The problem which now arises is this: a terminal capable of receiving kanji data according to the system described above would interpret each character as one half of a kanji. It could not receive normal ASCII text without changing it into some garbled mess of kanji and kana. It would, of course, be desirable if the same terminal could interpret ASCII characters according to their normal meaning ALSO. The solution which was adopted for this may be inelegant, but is unavoidable within the limitations of the 7-bit format. It consists of switching between two modes: "ASCII mode" and "kanji mode". The mode is switched by means of an escape sequence. JIS code systems need two escape sequences: kanji in (KI) sequence: changes from ASCII mode to kanji mode kanji out (KO) sequence: changes from kanji mode to ASCII mode Of course, the disadvantage of this method is that the KI and KO strings may become garbled in transmission, leaving the system in the wrong mode. But I suppose a better solution wasn't possible in systems using only seven bits. KI and KO strings differ, according to the "dialect" of the JIS code which is in use. Three major dialects are "old JIS", "new JIS", and "NEC", which have respectively: KI KO ======= ======= old JIS ESC $ @ ESC ( H new JIS ESC $ B ESC ( J NEC ESC K ESC H (pica), ESC E (elite) "Old JIS" is, for instance, used by JICST and the Nikkei Telecom News data- base service. "New JIS" is used by the kanji editor program MOKE (by Mark Edwards), and in the Japanese section of the GENIE network. NEC printers use the NEC code. Some JIS systems can also handle hankaku katakana characters. These characters are encoded by one byte, with value 21 - 5f hex. To indicate that such codes must be interpreted as hankaku katakana rather than normal ASCII, hankaku katakana strings must be preceded and followed by special codes: the character SO (Ehex) switches from ASCII to hankaku katakana; the character SI (Fhex) switches from hankaku katakana to ASCII. This system is used to communicate with the 7-bit, "old JIS" data-bank JICST. You initiate a search by typing a keyword in ASCII or hankaku katakana (JICST does not accept zenkaku characters for input). The response from the system is in ASCII and "old JIS" zenkaku characters. The default mode for JIS systems is ASCII mode. 4) EUC coding EUC (Extended Unix Code) is a variant of JIS which is used on eight-bit UNIX systems such as can be found in university environments. The coding system is exactly the same as JIS, but the switch between ASCII mode and Kanji mode is not indicated by escape strings. Instead, characters in kanji sequences have the high bit set, while ASCII characters have the high bit cleared (zero). 5) SJIS coding In bulletin board systems (which are always 8-bit), and frequently also for internal character representation in Japanese personal computers, the so-called SJIS code is used. SJIS means shift-JIS, probably to indicate that "shifted" (high bit set) characters are used. They are used, however, in a way which is very different from that of the EUC system. There are three kinds of SJIS codes: controls, one-byte characters, and two- byte characters. Controls are represented by one byte, having the values 0-1f hex, or 0-31 decimal. Controls include codes for new line, carriage return, form feed, back space, etc. One byte characters are represented by one byte having a value ranging from 20 to 7E hex (32 to 126 decimal) or from A0 to DF hex (160 to 223 decimal). For values in the rage 20 to 7E hex, the meaning of the characters is the same as in standard ASCII. The range A1 to DF hex is used for hankaku katakana; these values are the same as the JIS hankaku katakana, but with the high bit set. On the IBM PC, this range is occupied by the "box draw" characters. The value A0 hex represents a space (same as 20 hex). A peculiarity is that on some systems (for instance the KDPLUS system) the one-byte characters can also be coded with two bytes; this is the case when the characters have been put somewhere in the non-standardized part of the ku- ten table, so that they have a normal two-byte address. On some systems (an example is the Ichitaro word-processing system on an AX) ASCII and hankaku katakana are kept out of the ku-ten table altogether, so these characters can only be selected with one-byte codes. Two-byte characters are represented by a "high" byte followed by a "low" byte. In order not to be mistaken for a control or a one-byte character, the "high" byte must use values which are not used by those characters, in the ranges 81- 9F hex and E0-EA hex. The "low" byte uses values in the range 40-FC hex, but the value 7F hex is skipped (not used). This may be a relic from the paper tape era. On paper tape systems, "all holes punched" was never used for a character, so that it was possible to erase characters on the tape by overpunching them. There are 188 possible values for the "low" byte and 42 for the "high" byte. Every possible value of the "high" byte can now encode 2 rows (2 x 94 characters) of the "ku-ten" table. In total therefore, 84 rows could be encoded, but only one row is encoded for the characters with "high byte" equal to EA hex. The algorithm for converting "ku-ten" values to "high-low" values is: high=0x80+(ku+1)/2 ; /* 2 ku values share the same high byte. */ if (high>0x9F) high+=0x40; /* if outside 81-9F range, lift to E0-EA range*/ if (ku&1) { /* ku is odd*/ low= 0x3F+ten; if (low>=0x7F) low++; } else low= 0x9E+ten; /* ku is even */ The decoding algorithm is equally straightforward: assume that we have already determined that a two-byte character has been sent, and we have the "high" and "low" bytes available. We calculate the "ku" and "ten" values as follows: if (high>=0xE0) high-=0x40; high-=0x80; ku=2*high - 1; /* always produces an odd value */ if (low > 0x9E) { /* ku is even: increase the value */ ku++; ten=low-0x9E; } else { /*ku is odd*/ if (low>0x7F) low--; ten=low-0x3F; } The treatment of the one-byte characters depends on where the hankaku characters are stored in the font, because this is hardly standardized. In the font KDP16JS.FNT, the hankaku ASCII characters are stored in row 9, and hankaku katakana in row 10. So we calculate "ku" and "ten" as follows: if (ch<0x20) { /* control character */ /*....put appropriate code here....*/ } else if ((ch==0x20)||(ch==0xA0)) { /* hankaku space */ ku=11; ten=1; } /* The separate treatment of the hankaku space is necessary, because, inconveniently, the hankaku ASCII row in the font file does not start with a space, but with the exclamation mark (ASCII 0x21). We get the space from row 11, which does start with a space. */ else if ((ch>0x20)&&(ch <= 0x7E)) {/* ASCII */ ku=9; ten=ch-0x20; } else if ((ch>0xA0)&&(ch <= 0xDF)) {/* hankaku katakana */ ku=10; ten=ch-0xA0; } else { /* not a one-byte character, but first half of two-byte character. */ /*....put appropriate code here.... */ } Of course many tricks can be applied to make the code more compact and faster. The separate treatment of the hankaku space can also be avoided with a small trick. The above explanation shows the principle, however. It is quite easy to make your program recognize KI and KO strings, and switch automatically between SJIS and JIS coding. It is not so easy to distinguish automatically between SJIS and EUC (at least not on the basis of single characters). 6) "Ku-ten" table files You can obtain a print-out of the ku-ten tables for Level 1 and Level 2 JIS by printing the files: level1.1 level1.2 level1.3 (for Level 1 JIS) level2.1 level2.2 level2.3 (for Level 2 JIS) Because the ku-ten tables are too wide to be printed on one sheet, they have been split into three parts, covering the columns 1-32, 33-64, and 65-94 respectively. You can print all three of them on a Japanese printer or word processing system, or on a "Western" printer using the print utilities of the KDPLUS system. Glue the tables together to get complete ku-ten tables. The tables are SJIS coded. To convert to JIS, use the KDPLUS SJIS2JIS utility. 7) Changes made in KDP16SJ.FNT A few changes have been made in KDP16SJ.FNT to adapt it for use with KDPLUS and the KDPLUS editor, JWRITE. The most important of those changes is that the IBM control code symbols (corresponding to ASCII values below 32) have been added to row 14 of the font, from position 11, and the IBM characters with values EB-FE are in that same row from position 75. Furthermore, the character ASCII 92 (5C hex), corresponding to ku=9, ten=60, is now displayed as a backslash, to make it conform to normal IBM PC usage (in the original KDP16SJ.FNT, as on most Japanese computer systems, this character is a "yen" sign). Also, some cosmetic changes have been made in the "tilde", apostrophe, reverse apostrophe, and quotation mark symbols, to make them useable as accents. The "equals" sign (=) has also been slightly modified. In combination with the capital Y, it makes a nice "yen" sign (through the accent facility of JWRITE), should you need it. If you don't like these changes, you can undo them using the font editor KFEDIT that comes with KDPLUS. Tokyo, 10 July 1991 (revised 14 January 1992, 16 February 1992, 20 May 1992) Jan W. Stumpel